This week we are launching into RStudio and R Markdown. You should have R, RStudio, and RTools (Windows OS) installed on your machine by now. Because R Notebooks and R Markdown will be the primary platform for writing and sharing code in this class, it is a good idea for us to build familiarity with it ASAP. This exercise is whimsical but also introduces several of the R Markdown formatting conventions that we will come to rely on and gives you a reason to practice what you read for today’s class. To wit this HTML file was produced by knitting the .Rmd file into the format specified in the YAML header—this will typically be an HTML file for us.
Some people think that I only like nerdy comics like xkcd, but they are mistaken. I also have a deep appreciation for what has come before. Take, for instance, Garfield by Jim Davis.
The table below lists the main characters, for those who may be Garfield noobs!
| Name | Description |
|---|---|
| Garfield | Cat |
| Odie | Dog |
| Jon | Human |
| Nermal | Cat? |
There are several R packages designed to help you create better looking tables in R Markdown and we will introduce a couple of those over the coming weeks (e.g., kable).
There many reasons that Garfield is a great character. Allow me to explain in bulleted list form…
There are lots of interesting things that you may not know about Garfield that you probably should know about Garfield. Here’s one… Did you know that Muncie, Indiana is the setting for the comic strip? Muncie also happens to be the home of Ball State University.
The image above is from last week, so this strip is still going strong!
I have already mentioned xkcd and Garfield, but there are sooooo many other strips and web comics out there these days! Here are a few other faves:
Poorly Drawn Lines by Reza
Farazmand
Bizarro by Wayne Piraro
Deliberately Buried by
Sean ???
Some of you may also be fans of comic books and the Marvel Cinematic Universe has really become embedded in American popular culture. The data included in the file marvel-wikia-data.csv was scraped in 2014 and was used in a FiveThirtyEight story about gender bias in the comic book industry. I realize that we have not yet introduced the dplyr package, but you did look at the piece on the readr package today. Let’s import this comic book characters dataset and poke around a little…
# install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data_0 <- read_csv("marvel-wikia-data.csv")
## Rows: 16376 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, urlslug, ID, ALIGN, EYE, HAIR, SEX, GSM, ALIVE, FIRST APPEAR...
## dbl (3): page_id, APPEARANCES, Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data_0)
## spec_tbl_df [16,376 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ page_id : num [1:16376] 1678 7139 64786 1868 2460 ...
## $ name : chr [1:16376] "Spider-Man (Peter Parker)" "Captain America (Steven Rogers)" "Wolverine (James \\\"Logan\\\" Howlett)" "Iron Man (Anthony \\\"Tony\\\" Stark)" ...
## $ urlslug : chr [1:16376] "\\/Spider-Man_(Peter_Parker)" "\\/Captain_America_(Steven_Rogers)" "\\/Wolverine_(James_%22Logan%22_Howlett)" "\\/Iron_Man_(Anthony_%22Tony%22_Stark)" ...
## $ ID : chr [1:16376] "Secret Identity" "Public Identity" "Public Identity" "Public Identity" ...
## $ ALIGN : chr [1:16376] "Good Characters" "Good Characters" "Neutral Characters" "Good Characters" ...
## $ EYE : chr [1:16376] "Hazel Eyes" "Blue Eyes" "Blue Eyes" "Blue Eyes" ...
## $ HAIR : chr [1:16376] "Brown Hair" "White Hair" "Black Hair" "Black Hair" ...
## $ SEX : chr [1:16376] "Male Characters" "Male Characters" "Male Characters" "Male Characters" ...
## $ GSM : chr [1:16376] NA NA NA NA ...
## $ ALIVE : chr [1:16376] "Living Characters" "Living Characters" "Living Characters" "Living Characters" ...
## $ APPEARANCES : num [1:16376] 4043 3360 3061 2961 2258 ...
## $ FIRST APPEARANCE: chr [1:16376] "Aug-62" "Mar-41" "Oct-74" "Mar-63" ...
## $ Year : num [1:16376] 1962 1941 1974 1963 1950 ...
## - attr(*, "spec")=
## .. cols(
## .. page_id = col_double(),
## .. name = col_character(),
## .. urlslug = col_character(),
## .. ID = col_character(),
## .. ALIGN = col_character(),
## .. EYE = col_character(),
## .. HAIR = col_character(),
## .. SEX = col_character(),
## .. GSM = col_character(),
## .. ALIVE = col_character(),
## .. APPEARANCES = col_double(),
## .. `FIRST APPEARANCE` = col_character(),
## .. Year = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
View(data_0)
The read_csv function is part of the
readr package and we use it to import—you guessed
it—.csv files. Note that the str function displays the
structure of an object while the View function allows us to
interact with the data in a separate window.
The reading for next time goes into further detail regarding the dplyr package which is the workhorse for data wrangling. It is fundamental to the work we will do this semester, and so we may as well start familiarizing ourselves with it.
data <- data_0 %>%
drop_na(SEX)
dim(data_0)
## [1] 16376 13
dim(data)
## [1] 15522 13
data %>%
group_by(SEX) %>%
count()
data %>%
count(SEX) %>%
mutate(percent = (n / sum(n)) * 100)
In the above code chunk, we use the tidyr::drop_na
function to remove observations (i.e., rows) in the dataset that do not
have a value for the SEX attribute. We then use the
base R function dim which is short for dimensions, to
compare the number of rows and columns before and after we perform that
operation.
Next, we use the pipe operator %>% to link multiple
functions together—this allows us to write fewer lines of code and
(arguably) makes it easier to understand what is happening! You read the
code from top to bottom. The data object above has the
dplyr::group_by function applied to it such that
observations (i.e., rows) are grouped according to the value of this
attribute, then the result is passed to the dplyr::count
function. Because there is no <- operator, the resulting
table is displayed but it is not stored in an object
that we can go back to later. By default, the count
function generates a new attribute (i.e., column) called
n which can be referenced in subsequent functions that
are part of the sequence and linked though the %>%
operator.
The final bit of the above code chunk eliminates the
group_by component and instead applies the
count function directly to the SEX
attribute. Then, the dplyr::mutate function is used
to create a new attribute (i.e., column) alongside the
n attribute that contains the percentage value. Again,
because there is no <- operator, the resulting table is
displayed but it is not stored in an object that we can
go back to later.
This quick analysis shows that there are far fewer male characters, but we could also ask if the number of appearances is more or less skewed.
Insert a new code chunk then try to modify the preceding code to determine:
Keeping in mind that these data are circa 2014, you can limit the
dataset to characters introduced in 2010 or later like this
data_2010_2014 <- data %>% filter(Year >= 2010)
Hint: you will probably want to create a standalone
object that contains the total number of appearances for use as the
denominator in your calculations. You can access the function reference
for the dplyr package here.
If you are hungry for more dplyr try rerunning the same code for the DC Comics dataset included with this assignment (i.e., dc-wikia-data.csv).
The last thing I want to introduce here is the rationale for installing RTools (or Xcode Command Line Tools if you have a Mac). Sometimes we want to access R packages that are not available on the official CRAN mirror sites and usually that means downloading and compiling from a platform like GitHub. In the chunk below, we set out preferred CRAN mirror in the code, then install the devtools package, which allows us to pull packages like emo from GitHub. The emo package allows us to insert emoji into our R Notebooks, which will really increase your enjoyment and quality of life.
options(repos=c(CRAN="https://mirrors.nics.utk.edu/cran/"))
install.packages("devtools")
## package 'devtools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\bw6xs\AppData\Local\Temp\RtmpoBVwQq\downloaded_packages
devtools::install_github("hadley/emo")
library(emo)
If we did not have RTools (or Xcode Command Line Tools if you have a Mac) installed, this part wouldn’t work and the mood would be decidedly 😢
Take a look at this page to get a sense of which keywords are associated with your fave emoji 😮 but you should know that if there are multiple emoji associated with a given keyword, RStudio randomly grabs one each time.